71 research outputs found

    A comparative Genome-Wide Association Interaction study using BOOST and MB-MDR algorithms on Ankylosing Spondylitis

    Full text link
    Genome-Wide Association (GWA) studies have gained popularity after the completion of the Human Genome Project and advancement of high-throughput technologies. These studies aim to scan thousands of genomic variations (e.g., SNPs) for their association to phenotypic variables (i.e. traits), such as disease related phenotypes, with the hope of extracting biologically and clinically relevant information. Understanding of genetic, environmental as well as other components of the disease brings the key insights into disease pathology and approaches us closer to the ultimate goal - personalized medicine. In this work we rely on a minimal GWAI protocol for genome-wide epistasis detection using SNPs, as developed in our lab [6][9]. Using the advanced non-parametric Model-Based Multifactor Dimensionality Reduction (MB-MDR) method [1] and BOolean Operation-based Screening and Testing (BOOST) algorithms [4][*] for detection of statistically significant epistatic SNP-SNP interactions, we investigate the effect of exhaustive (BOOST) and non-exhaustive (MB-MDR) marker processing strategies, LD effects, as well as different adjustment schemes for lower-order effects (i.e. epistasis). Our approach was tested on Ankylosing Spondylitis (AS) data as provided by the WTCCC2 consortium [1]. AS is a long-term / chronic disease characterized by inflammation of the joints between the spinal bones. Non-steroidal anti-inflammatory drugs calming down the immune system inflammatory responses are used as a treatment but there is no permanent cure for AS. The disease has also a strong environmental component and affects 3.5 - 13 per 1,000 people in USA [5

    A cautionary note on the impact of protocol changes for Genome-Wide Association SNP x SNP Interaction studies: an example on ankylosing spondylitis

    Full text link
    Genome-wide association interaction (GWAI) studies have increased in popularity. Yet to date, no standard protocol exists. In practice, any GWAI workflow involves making choices about quality control strategy, SNP filtering, linkage disequilibrium (LD) pruning, analytic tool to model or to test for genetic interactions. Each of these can have an impact on the final epistasis findings and may affect their reproducibility in follow-up analyses. Choosing an analytic tool is not straightforward, as different such tools exist and current understanding about their performance is based on often very particular simulation settings. In the present study, we wish to create awareness for the impact of (minor) changes in a GWAI analysis protocol can have on final epistasis findings. In particular, we investigate the influence of marker selection and marker prioritization strategies, LD pruning and the choice of epistasis detection analytics on study results, giving rise to 8 GWAI protocols. Discussions are made in the context of the ankylosing spondylitis (AS) data obtained via the Wellcome Trust Case Control Consortium (WTCCC2). As expected, the largest impact on AS epistasis findings is caused by the choice of marker selection criterion, followed by marker coding and LD pruning. In MB-MDR, co-dominant coding of main effects is more robust to the effects of LD pruning than additive coding. We were able to reproduce previously reported epistasis involvement of HLA-B and ERAP1 in AS pathology. In addition, our results suggest involvement of MAGI3 and PARK2, responsible for cell adhesion and cellular trafficking. Gene Ontology (GO) biological function enrichment analysis across the 8 considered GWAI protocols also suggested that AS could be associated to the Central Nervous System (CNS) malfunctions, specifically, in nerve impulse propagation and in neurotransmitters metabolic processes

    Integration of Gene Expression and Methylation to unravel Biological Networks in Glioblastoma Patients

    Full text link
    peer reviewedThe vast amount of heterogeneous omics data, encompassing a broad range of biomolecular information, requires novel methods of analysis, including those that integrate the available levels of information. In this work we describe Regression2Net, a computational approach that is able to integrate gene expression and genomic or methylome data in two steps. First, penalized regressions are used to build Expression-Expression (EEnet) and Expression-Genome or –Methylome (EMnet) networks. Second, network theory is used to highlight important communities of genes. When applying our approach Regression2Net to gene expression and methylation profiles for individuals with glioblastoma multiforme, we identified respectively 284 and 447 potentially interesting genes in relation to glioblastoma pathology. These genes showed at least one connection in the integrated networks ANDnet and XORnet derived from aforementioned EEnet and EMnet networks. Whereas the edges in ANDnet occur in both EEnet and EMnet, the edges in XORnet occur in EMnet but not in EEnet. In-depth biological analysis of connected genes in ANDnet and XORnet revealed genes that are related to energy metabolism, cell cycle control (AATF), immune system response and several cancer types. Importantly, we observed significant over-representation of cancer related pathways including glioma, especially in the XORnet network, suggesting a non-ignorable role of methylation in glioblastoma multiforma. In the ANDnet, we furthermore identified potential glioma suppressor genes ACCN3 and ACCN4 linked to the NBPF1 neuroblastoma breakpoint family, as well as numerous ABC transporter genes (ABCA1, ABCB1) suggesting drug resistance of glioblastoma tumors

    Genome-wide environmental interaction analysis using multidimensional data reduction principles to identify asthma pharmacogenetic loci in relation to corticosteroid therapy

    Full text link
    Genome-wide gene-environment (GxE) and gene-gene (GxG) interaction studies share a lot of challenges via the common genetic component they involve. GWEI studies may therefore benefit from the abundance of methodologies that are available in the context of genome-wide epistasis detection methods. One of these is Model-Based Multifactor Dimensionality Reduction (MB-MDR), which does not make any assumption about the genetic inheritance model. MB-MDR involves reducing a high-dimensional GxE space to GxE factor levels that either exhibit high or low or no evidence for their association to disease outcome. In contrast to logistic regression and random forests, MB-MDR can be used to detect GxE interactions in the absence of any main effects or when sample sizes are too small to be able to model all main and GxE interaction effects. In this ongoing study, we demonstrate the opportunities and challenges of MB-MDR for genome-wide GxE interaction analysis and analyzed the difference in prebronchodilator FEV1 following 8 weeks of inhaled corticosteroid therapy, for 565 pediatric Caucasian CAMP (ages 5-12) from the SHARE project

    A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples

    Get PDF
    The synthesis of this dataset was enabled by funding from the Canada Foundation for Innovation, from Genome Canada through Ontario Genomics, from NSERC, and from the Ontario Ministry of Research, Innovation and Science in support of the International Barcode of Life project. It was also enabled by philanthropic support from the Gordon and Betty Moore Foundation and from Ann McCain Evans and Chris Evans. The release of the data on GGBN was supported by a GGBN – Global Genome Initiative Award and we thank G. Droege, L. Loo, K. Barker, and J. Coddington for their support. Our work depended heavily on the analytical capabilities of the Barcode of Life Data Systems (BOLD, www.boldsystems.org). We also thank colleagues at the CBG for their support, including S. Adamowicz, S. Bateson, E. Berzitis, V. Breton, V. Campbell, A. Castillo, C. Christopoulos, J. Cossey, C. Gallant, J. Gleason, R. Gwiazdowski, M. Hajibabaei, R. Hanner, K. Hough, P. Janetta, A. Pawlowski, S. Pedersen, J. Robertson, D. Roes, K. Seidle, M. A. Smith, B. St. Jacques, A. Stoneham, J. Stahlhut, R. Tabone, J.Topan, S. Walker, and C. Wei. For bioblitz-related assistance, we are grateful to D. Ireland, D. Metsger, A. Guidotti, J. Quinn and other members of Bioblitz Canada and Ontario Bioblitz. For our work in Canada’s national parks, we thank S. Woodley and J. Waithaka for their lead role in organizing permits and for the many Parks Canada staff who facilitated specimen collections, including M. Allen, D. Amirault-Langlais, J. Bastick, C. Belanger, C. Bergman, J.-F. Bisaillon, S. Boyle, J. Bridgland, S. Butland, L. Cabrera, R. Chapman, J. Chisholm, B. Chruszcz, D. Crossland, H. Dempsey, N. Denommee, T. Dobbie, C. Drake, J. Feltham, A. Forshner, K. Forster, S. Frey, L. Gardiner, P. Giroux, T. Golumbia, D. Guedo, N. Guujaaw, S. Hairsine, E. Hansen, C. Harpur, S. Hayes, J. Hofman, S. Irwin, B. Johnston, V. Kafa, N. Kang, P. Langan, P. Lawn, M. Mahy, D. Masse, D. Mazerolle, C. McCarthy, I. McDonald, J. McIntosh, C. McKillop, V. Minelga, C. Ouimet, S. Parker, N. Perry, J. Piccin, A. Promaine, P. Roy, M. Savoie, D. Sigouin, P. Sinkins, R. Sissons, C. Smith, R. Smith, H. Stewart, G. Sundbo, D. Tate, R. Tompson, E. Tremblay, Y. Troutet, K. Tulk, J. Van Wieren, C. Vance, G. Walker, D. Whitaker, C. White, R. Wissink, C. Wong, and Y. Zharikov. For our work near Canada’s ports in Vancouver, Toronto, Montreal, and Halifax, we thank R. Worcester, A. Chreston, M. Larrivee, and T. Zemlak, respectively. Many other organizations improved coverage in the reference library by providing access to specimens – they included the Canadian National Collection of Insects, Arachnids and Nematodes, Smithsonian Institution’s National Museum of Natural History, the Canadian Museum of Nature, the University of Guelph Insect Collection, the Royal British Columbia Museum, the Royal Ontario Museum, the Pacifc Forestry Centre, the Northern Forestry Centre, the Lyman Entomological Museum, the Churchill Northern Studies Centre, and rare Charitable Research Reserve. We also thank the many taxonomic specialists who identifed specimens, including A. Borkent, B. Brown, M. Buck, C. Carr, T. Ekrem, J. Fernandez Triana, C. Guppy, K. Heller, J. Huber, L. Jacobus, J. Kjaerandsen, J. Klimaszewski, D. Lafontaine, J-F. Landry, G. Martin, A. Nicolai, D. Porco, H. Proctor, D. Quicke, J. Savage, B. C. Schmidt, M. Sharkey, A. Smith, E. Stur, A. Tomas, J. Webb, N. Woodley, and X. Zhou. We also thank K. Kerr and T. Mason for facilitating collections at Toronto Zoo and D. Iles for servicing the trap at Wapusk National Park. This paper contributes to the University of Guelph’s Food from Thought research program supported by the Canada First Research Excellence Fund. The Barcode of Life Data System (BOLD; www.boldsystems.org)8 was used as the primary workbench for creating, storing, analyzing, and validating the specimen and sequence records and the associated data resources48. The BOLD platform has a private, password-protected workbench for the steps from specimen data entry to data validation (see details in Data Records), and a public data portal for the release of data in various formats. The latter is accessible through an API (http://www.boldsystems.org/index.php/resources/api?type=webservices) that can also be controlled through R75 with the package ‘bold’76.Peer reviewedPublisher PD

    Identification of asthma-related trans-acting epistatic eQTLs using Model-Based Multifactor Dimensionality Reduction (MB-MDR)

    Full text link
    Epistasis is likely to underlie most complex traits, including gene expression, yet it is very difficult to detect using standard approaches. SNPs located inside a gene coding region or in its vicinity (i.e. ≤2 Mb from each 5’ and 3’ side) can influence the corresponding gene expression levels. These expression quantitative trait loci (eQTLs) are referred to as cisSNPs. In contrast, eQTLs that are outside the aforementioned gene range can also influence the gene’s expression, in which case, they are called transSNPS to that gene. In this study we considered significant cisSNPs previously identified via generalized least squares (GLS) regression modeling. We then identified those genes transcripts whose expression is regulated by cis/transSNP interaction. In this work we aimed at identifying transcripts whose expression is regulated by a cis/transSNP interactions using Model-Based Multifactor Dimensionality Reduction (MB-MDR) [2]. This model-free approach to detect trans-epistasis involves reducing a high-dimensional GxG space to GxG factor levels that either exhibit high evidence, low evidence or no evidence at all for their association to gene expression levels of interest. Our protocol was applied on real-life data from the childhood asthma management program (CAMP) [1]. It involved coupling a traditional a priori eQTL search to an a posteriori trans-epistasis analysis to identify genetic modifiers to statistically significant cisSNPs. Such an approach allows to reveal previously unreported inter-dependencies that may be important in understanding of biological mechanisms underlying human complex diseases such as asthma. The proposed protocol identified a large number trans-epistasis gene-gene effects of eQTLs

    In silico study of the interaction of the Myelin Basic Protein C-terminal a-helical peptide with DMPC and mixed DMPC/DMPE lipid bilayers

    Full text link
    Biological membranes continue to be extensively investigated in different ways. This paper presents the benefits of Molecular Dynamics (MD) approaches to study the properties of biological membranes and proteins using the freely available GROMACS package, in the context of the Myelin Basic Protein (MBP) C-terminal a-helical peptide. A mixed membrane consisting of 2-Dimyristoyl-sn-Glycero-3-phosphocholine/1,2-Dimyristoyl-sn-Glycero-3- phosphoethanolamine (DMPC/DMPE), and pure DMPC membranes, composed of 188 and 248 lipids, respectively, were simulated for 200 ns at 309 K. The DMPC membrane was approximately three times more fluid compared to the DMPC/DMPE system, with the diffusion coefficients (D) being 0.0207x10-5 cm2/s and 0.0068x10-5 cm2/s, respectively. In addition, the 14-residue peptide representing the C-terminal a-helical region of murine Myelin Basic Protein (MBP), with amino acid sequence NH2-A141YDAQGTLSKIFKL154-COOH , was simulated in both membrane systems for 200 ns. The peptide penetrated further into the DMPC bilayer compared to the mixed DMPC/DMPE bilayer, potentially because of the reduced accessibility of the charged peptide amino acid side chains to the formal positive charge of the amine N atom surrounded by methyl and methylene groups in DMPC, that might have resulted in greater overall peptide mobility [3]. These findings are significant in their implication that membrane composition affects the behavior of MBP, providing further insights into myelin structure. Our preliminary results suggest that local changes in membrane composition (e.g. enrichment in DMPE molecules), as well as, electrostatic nature of primary amino acid sequence could cause localized denaturation / instability of external MBP a-helices possibly augmenting the degradation of myelin in multiple sclerosis (MS), resulting in a subsequent decrease of nerve impulse propagation efficiency

    From Statistical to Biological Interactions via Omics Integration

    Full text link
    The XXI century opened a new ‘Big Data’ era in which, thanks to rapid technological advancements and appearance of high throughput technologies, vast amounts of unstructured omics data (e.g., transcriptomic, genomic, etc.) are generated every day. This thesis mainly focuses on solving the problems related diverse omics data integration and interaction identification tasks. Particular attention is given to useful knowledge extraction in the context of complex diseases including pathological mechanisms with the development of software tools and pipelines. The diseases covered included glioblastoma multiforme, asthma, and ankylosing spondylitis. Interactions detection in genomic data requires standardization of the protocols. In Chapter 3, we tested the impact of different settings in a genome-wide association interaction study (GWAIS). Some of the settings included marker selection strategy, the LD pruning, lower order effects adjustment, analytical tool. We were able to show that even small changes in each setting can have drastic impacts requiring careful assessment of proper settings and results comparisons across several analysis protocols. The greatest impact was attributed to the input dataset composition highlighting the importance of the marker selection strategy and use of prior knowledge. Expression of genes can be affected by nearby (‘cis’) or distant (‘trans’) genotypes. Thus, we developed methodology to identify complex trans/cis regulatory mechanisms between expression and genotype data in the context of asthma (CAMP data). Significant overlap between ‘trans’ and ‘cis’ gene regulatory components related to immune and signaling pathways was clearly identified matching asthma disease pathology. The semi-parametric Model-Based Multifactor Dimensionality Reduction (MB-MDR) method was applied for the first time in the context eQTL study achieving low false discovery and family-wise error rates (FDR and FWER). Identification of a meaningful data structure from omics data is a pressing topic nowadays. Gene regulatory networks (GRN) conveniently summarize large amounts of data allowing for useful knowledge generation. GRN inference is especially attractive for deciphering of complex diseases mechanisms allowing biologists to formulate a better hypothesis. We were able to generate GRNs from a single source (e.g., microarray expression data) using conditional inference forest (CIF) with more attractive features compared to classical Random-Forest (RF) including unbiased node variable selection even in the context of highly correlated variables particularly relevant in transcriptomics. The CIF methods provided attractive features and performance characteristics coupled to valuable pathological insights into type 1 diabetes.Le XXIe siècle a ouvert une nouvelle ère du «Big Data». Grâce aux progrès rapides et à l’apparition des technologies à haut débit, de vastes quantités de données omiques non structurées (par exemple transcriptome, génomique, etc.) sont générées chaque jour. Cette thèse s’axe principalement sur la résolution des problèmes liés à l'identification des interactions et l'intégration de divers données omiques. Une attention particulière a été accordée à l'extraction de connaissances «utiles» dans le contexte des maladies complexes, y compris les mécanismes pathologiques, ainsi qu’au développement de logiciels et de pipelines. Les maladies couvertes incluent le glioblastome multiforme, l'asthme et la spondylarthrite ankylosante. La détection des interactions dans les données génomiques exige la standardisation du protocole. Nous avons testé l'impact des différents paramètres sur la composition des résultats finaux dans une étude d'interaction association pangénomique (GWAIS) sur l'ensemble du génome. Certains des paramètres en questions sont la sélection de la stratégie des marqueurs de sélection, le déséquilibre de liaison (LD), le faible ajustement des effets principaux et l’outil d'analyse choisi. Nous avons pu montrer que chaque paramètre pourrait avoir des effets drastiques qui nécessitent une évaluation attentive des paramètres appropriés et d’analyse comparative des résultats entre plusieurs pistes. Le plus grand impact a été attribué à la composition de l'ensemble de données lié à la stratégie de sélection des marqueurs et à l’utilisation de connaissance préalable. L'expression des gènes pourrait être affectée par génotypes à proximité (‘cis’) ou à distance (‘trans’). Ainsi, nous avons cherché à identifier des mécanismes mixtes trans/cis existants entre les données d'expression et de génotypes dans le contexte de l'asthme (données CAMP). Un chevauchement important existe entre les composants de régulation ‘trans’ et ‘cis’ liés au système immunitaire et à la signalisation correspondant à la pathologie de la maladie de l'asthme. La méthode semi-paramétrique Model-Based Multifactorielle Dimensionnalité Réduction (MB-MDR) a été appliqué pour la première fois dans l'étude eQTL, ce qui a permis d’atteindre un taux de faux positifs bas. La recherche d'une structure de données significatives à partir de plusieurs sources hétérogènes de données omiques est un sujet de recherche important à l’heure actuelle. Les réseaux de régulation des gènes (GRN) résument facilement de grandes quantités de données permettant la production de connaissances utiles. L’inférence GRN est particulièrement attrayante pour déchiffrer des mécanismes de maladies complexes permettant aux biologistes de formuler des hypothèses plus exactes. Nous avons été en mesure de produire un GRN à partir d'une seule source (par exemple, les données de biopuces d’expression) en utilisant des forêts d’inférence conditionnelle (CIF) avec des caractéristiques plus attrayantes par rapport à des forêts aléatoires classiques (Random Forests). Les avantages comprennent l’impartialité de sélection de variables liées à un noeud, l’impartialité même dans le contexte de variables corrélées particulièrement pertinente pour les donnes transcriptomique. Les CIF méthodes possèdent des caractéristiques attrayantes et conduisent à de bonnes performances. Ces méthodes fournissent des idées sur les mécanismes pathologiques du le diabète de type 1

    Gene Regulatory Network Inference via Conditional Inference Trees and Forests

    Full text link
    Trees are classical data structures allowing effectively classifying and predicting responses. Due to versatility and high performance in classification and prediction, there exist plenty of tree-based methods including popular Conditional Inference Tree (CIT) and Forests (CIF), Random Forests (RF), Randomized Trees (RT), randomized C4.5, etc. In this work we assessed the performance of CIT and CIF methods in correct gene regulatory network (GRN) prediction from expression data by using reference golden standard built from real transcriptional regulatory network of E. coli. The synthetic microarray expression data was obtained from DREAM4 challenge. The performance of each network inference method was assessed via Area Under Receiver Operating Characteristic (AUROC) and Area Under Precision Recall (AUPR) metrics. Our preliminary results show that CIT and CIF successfully predict directed GRNs at acceptable performance rates although not optimal (the best AUROC at 0.68 and AUPR at 0.13 for CIF and the best AUROC at 0.58 and AUPR at 0.18 for CIT). Surprisingly by using the current aggregation scheme of feature importance that prefers features with the highest number of observations, a single CIT was a better performer compared to CIFs in all 5 networks. Nevertheless, the CIFs showed an overall 10% improvement in AUROC. A single CIT has 24% and CIFs have 27% lower overall performance compared to the best performer of DREAM4 Challenge based on cumulative areas of PR and ROC curves. We plan to test other feature importance aggregation techniques in a single tree and in tree ensembles in order to outperform the top DREAM4 algorithms. In addition the effects of expression data standardization to unit variance will be presented. In future, the developed CIF framework will be used to perform data integration analysis of multi-omics datasets
    • …
    corecore